Atomic migrations #7909

chrismwendt · 2020-01-21T01:06:57Z

tl;dr this makes migrations atomic, which ensures that successful migrations always leave the DB in a clean state.

Prior to this change, it was possible for the frontend to disconnect from Postgres halfway through a migration (e.g. when the migration takes >5 minutes and k8s kills the frontend because it's not responding), leaving the DB marked as being in a dirty state despite the migration eventually succeeding.

After this change, each migration will set dirty=false within the transaction to guarantee that when a migration succeeds the DB will be in a clean state.

More context:

Motivating prod error https://sourcegraph.slack.com/archives/C0J618TTM/p1579282575001900
Investigation https://sourcegraph.slack.com/archives/C07KZF47K/p1579287343014500
Filed an issue on golang-migrate Dirty DB despite successful migration golang-migrate/migrate#325
Submitted a PR to golang-migrate Maintain consistency in transactional DBs golang-migrate/migrate#326, probably won't get merged soon
Problem statement and alternatives https://sourcegraph.slack.com/archives/C07KZF47K/p1579546621063800

A few alternatives:

Accept that migrations can fail in this way and document how to fix the DB when this happens
Use my patched version of golang-migrate until a satisfactory solution has been upstreamed
Mitigation: increase the timeout for DB migrations
Don't merge slow migrations into master, somehow check perf before merging
Workaround 1: SET version=..., dirty=false at the end of each migration in our migrations/ directory
Workaround 2 (implemented in this PR): splice that statement into the migration at runtime by intercepting the migration loader
Workaround 3: inline the ~40 lines of relevant code from golang-migrate and patch that in our source

3 teammates supported workaround 3:

I vote for workaround 3, since we’re postgres only and the value-add from that lib is not that high. We can also do more postgres specific things if we go workaround 3 route and do stuff more tailored towards our exact needs.

Upon further consideration, it became more clear to me that workaround 2 would be better:

@eric and I chatted about this in depth and realized that workaround 3 is not very desirable because it would not play well with the migrate command line tool.

Workaround 2 is more compatible and doesn't change the control flow as much.

Given that, we plan to proceed with workaround 2.

This PR is based on https://github.com/sourcegraph/sourcegraph/pull/7905, which is expected to be merged before this PR.

unknwon · 2020-01-21T04:31:06Z

FYI, I changed the merge base branch to the clean-up-migrate-db for easier review (real change set for this PR):

Don't forget to change it back to master once #7905 is merged :)

internal/db/dbutil/dbutil.go

tsenart · 2020-01-21T11:04:25Z

internal/db/dbutil/dbutil.go

+			return nil, errors.New("expected migration filename to be of the form <version>_<name>")
+		}
+		version := versionParts[0]
+		newContents := strings.Replace(string(oldContents), "COMMIT;", fmt.Sprintf("UPDATE schema_migrations SET version='%s', dirty=false;\nCOMMIT;", version), 1)


I would expect golang-migrate to already update the version in the database. If that premise is valid, why do we need to re-update it? Would it be enough to just set dirty=false (and avoid all the version parsing code)?

Would it be enough to just set dirty=false (and avoid all the version parsing code)?

Oh right, yes, golang-migrate sets the version so we can drop the version parsing code.

No you can't!

If you set dirty=false but not the version, the frontend will not be around to update the version (necessarily) so you'll have applied the migration but the version will be the old one. This will cause the next frontend attempt to fail (forever).

@efritz golang-migrate sets the version and dirty=true prior to the migration.

https://github.com/golang-migrate/migrate/blob/v4.8.0/migrate.go#L739-L754

// set version with dirty state if err := m.databaseDrv.SetVersion(migr.TargetVersion, true); err != nil { return err } if migr.Body != nil { m.logVerbosePrintf("Read and execute %v\n", migr.LogString()) if err := m.databaseDrv.Run(migr.BufferedBody); err != nil { return err } } // set clean state if err := m.databaseDrv.SetVersion(migr.TargetVersion, false); err != nil { return err }

I believe that means the frontend doesn't need to be around to update the version.

I won't merge this in until we reach consensus here.

Oh, I see. I thought it only set dirty=true first, then dirty=false, version=target at the end.

I'm convinced you can remove it now.

codecov · 2020-01-21T21:00:41Z

Codecov Report

Merging #7909 into clean-up-migrate-db will increase coverage by <.01%.
The diff coverage is 62.5%.

@@                   Coverage Diff                   @@
##           clean-up-migrate-db    #7909      +/-   ##
=======================================================
+ Coverage                40.57%   40.57%   +<.01%     
=======================================================
  Files                     1279     1279              
  Lines                    66889    66896       +7     
  Branches                  6320     6320              
=======================================================
+ Hits                     27139    27144       +5     
- Misses                   37277    37278       +1     
- Partials                  2473     2474       +1

Impacted Files	Coverage Δ
internal/db/dbutil/dbutil.go	`17.81% <62.5%> (+2.24%)`	⬆️

chrismwendt · 2021-09-18T07:32:21Z

I merged this into the wrong branch 🤦 It never made it into master.

I'm attempting this again in (TODO make PR)

efritz · 2021-09-19T15:18:19Z

@chrismwendt I think there's a problem in this solution as well. ~~If the migration continues to run after the timeout (and the client disconnects), then it may be the case that a new frontend will try to re-attempt the migration and fail.~~ (This is incorrect since the dirty flag is still visible while that query is running).

This could cause a lot of noise during long migrations that still result in a CrashLoopBackoff that might eventually resolve without user intervention.

Like, it's a better state than we're in, but it's not nearly as good as what RFC 469's proposal guarantees (imo).

chrismwendt · 2021-09-19T19:53:09Z

@efritz It looks like golang-migrate acquires an advisory lock before checking the dirty state, so frontends won't see the dirty state:

If it did it the other way around, then on upgrade all but then frontend that first began the migration would throw errors about a dirty DB (which doesn't happen).

Although the frontend isn't around to explicitly release the lock, Postgres automatically releases it once the session ends (which I'm guessing is immediately after the SQL commands finish):

I tried interrupting migrations in a few different ways to see the behavior:

migration-interruptions.mp4

It seems to me that SET dirty = 'f' within the migration produces desirable behavior, but it's been a long time since I first looked at this. Am I misunderstanding anything here?

chrismwendt requested review from efritz and a team January 21, 2020 01:06

chrismwendt requested review from beyang and emidoots as code owners January 21, 2020 01:06

chrismwendt force-pushed the atomic-migrations branch from 20a35d5 to 668520f Compare January 21, 2020 01:10

efritz approved these changes Jan 21, 2020

View reviewed changes

chrismwendt force-pushed the atomic-migrations branch 2 times, most recently from 476af06 to c2f18e8 Compare January 21, 2020 03:23

unknwon changed the base branch from master to clean-up-migrate-db January 21, 2020 04:28

unknwon approved these changes Jan 21, 2020

View reviewed changes

internal/db/dbutil/dbutil.go Outdated Show resolved Hide resolved

internal/db/dbutil/dbutil.go Outdated Show resolved Hide resolved

unknwon reviewed Jan 21, 2020

View reviewed changes

internal/db/dbutil/dbutil.go Show resolved Hide resolved

tsenart approved these changes Jan 21, 2020

View reviewed changes

chrismwendt added 5 commits January 21, 2020 12:26

atomic migrations

66711a6

better docstring

d8a150f

do not set the version (was redundant)

f342c50

unexport

6e1ed36

add unit test

d6e1c4b

chrismwendt force-pushed the atomic-migrations branch from cd86045 to d6e1c4b Compare January 21, 2020 20:50

chrismwendt merged commit c809444 into clean-up-migrate-db Jan 21, 2020

chrismwendt deleted the atomic-migrations branch January 21, 2020 21:22

chrismwendt mentioned this pull request Jan 22, 2020

Code intelligence tracking issue 3.13 #7625

Closed

61 tasks

chrismwendt mentioned this pull request Sep 18, 2021

migrations: dirty flag can linger even after a successful migration #25089

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Atomic migrations #7909

Atomic migrations #7909

Uh oh!

chrismwendt commented Jan 21, 2020 •

edited

Loading

Uh oh!

unknwon commented Jan 21, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tsenart Jan 21, 2020

Uh oh!

chrismwendt Jan 21, 2020

Uh oh!

efritz Jan 21, 2020

Uh oh!

chrismwendt Jan 21, 2020

Uh oh!

efritz Jan 21, 2020

Uh oh!

codecov bot commented Jan 21, 2020 •

edited

Loading

Uh oh!

chrismwendt commented Sep 18, 2021

Uh oh!

efritz commented Sep 19, 2021 •

edited

Loading

Uh oh!

chrismwendt commented Sep 19, 2021

Uh oh!

Uh oh!

Atomic migrations #7909

Atomic migrations #7909

Uh oh!

Conversation

chrismwendt commented Jan 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unknwon commented Jan 21, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tsenart Jan 21, 2020

Choose a reason for hiding this comment

Uh oh!

chrismwendt Jan 21, 2020

Choose a reason for hiding this comment

Uh oh!

efritz Jan 21, 2020

Choose a reason for hiding this comment

Uh oh!

chrismwendt Jan 21, 2020

Choose a reason for hiding this comment

Uh oh!

efritz Jan 21, 2020

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chrismwendt commented Sep 18, 2021

Uh oh!

efritz commented Sep 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrismwendt commented Sep 19, 2021

Uh oh!

Uh oh!

chrismwendt commented Jan 21, 2020 •

edited

Loading

codecov bot commented Jan 21, 2020 •

edited

Loading

efritz commented Sep 19, 2021 •

edited

Loading